{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1b4a5007-723e-4c19-8827-e78ba1789696",
   "metadata": {},
   "source": [
    "# Understanding Categorical Similarity Space\n",
    "\n",
    "CategoricalSimilaritySpace is best used to represent categorical similarity information where there are few categories which don't have semantic names to embed them as text. The space creates an n-hot encoding of the categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d584b589-4ec7-4a27-96eb-1b38680d3462",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install superlinked==3.38.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "988afce9-46c5-4d1a-b3cc-1a76c2806dc1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from superlinked.framework.common.schema.id_schema_object import IdField\n",
    "from superlinked.framework.common.schema.schema import schema\n",
    "from superlinked.framework.common.schema.schema_object import String\n",
    "from superlinked.framework.dsl.index.index import Index\n",
    "from superlinked.framework.dsl.space.categorical_similarity_space import (\n",
    "    CategoricalSimilaritySpace,\n",
    ")\n",
    "from superlinked.framework.dsl.query.param import Param\n",
    "\n",
    "from superlinked.framework.dsl.executor.in_memory.in_memory_executor import (\n",
    "    InMemoryExecutor,\n",
    ")\n",
    "from superlinked.framework.dsl.source.in_memory_source import InMemorySource\n",
    "from superlinked.framework.dsl.query.query import Query\n",
    "\n",
    "pd.set_option(\"display.max_colwidth\", 100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d4f6cca4-90c8-4029-ab6a-635a20b7f236",
   "metadata": {},
   "outputs": [],
   "source": [
    "@schema\n",
    "class Product:\n",
    "    id: IdField\n",
    "    category: String\n",
    "\n",
    "\n",
    "product = Product()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1265e0e-04ca-4311-886d-f638ece94005",
   "metadata": {},
   "source": [
    "## Creating a categorical embedding\n",
    "\n",
    "Decision items:\n",
    "1. What are the `categories` that will get their own column in the [n-hot](https://stats.stackexchange.com/questions/467633/what-exactly-is-multi-hot-encoding-and-how-is-it-different-from-one-hot) encoding.<br>\n",
    "    Other will be classified as `other` and will be represented in the last column of the encoding.\n",
    "1. Should items in the other category be similar to each other? Set `uncategorized_as_category` accordingly.<br>\n",
    "    If set to `True`, all `other` items are similar to each other, while otherwise they never are (even if the same category is encoded).<br>\n",
    "    If the intention is to make a category value similar to only the same category items, it should be added to `categories`.\n",
    "1. There is a possibility to set `negative_filter` to a negative number, so non-matching categories will result in negative similarity<br>\n",
    "    (contrary to simply not contributing to similarity) therefore setting these results substantially back in the order.\n",
    "1. Do you have a multilabel setup - as in categories can have multiple categories? In that case, set `category_separator` so our space can parse <br>\n",
    "    the categories by splitting the supplied text using `category_separator`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "89412b45-7069-4970-a24f-a4c4f5eda6a7",
   "metadata": {},
   "outputs": [],
   "source": [
    "category_space_uncategorized_category = CategoricalSimilaritySpace(\n",
    "    category_input=product.category,\n",
    "    categories=[\"category-1\", \"category-2\", \"category-3\"],\n",
    "    uncategorized_as_category=True,\n",
    "    category_separator=\",\",  # setting up a separator to split multiple categories\n",
    ")\n",
    "product_index = Index(category_space_uncategorized_category)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "acedaea4-56cc-4229-a1d6-36f225c8f007",
   "metadata": {},
   "outputs": [],
   "source": [
    "source: InMemorySource = InMemorySource(product)\n",
    "executor = InMemoryExecutor(sources=[source], indices=[product_index])\n",
    "app = executor.run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "24f2cb81-ace2-48de-bc78-1ab0fd5bffd8",
   "metadata": {},
   "outputs": [],
   "source": [
    "source.put(\n",
    "    [\n",
    "        {\"id\": \"product-1\", \"category\": \"category-1\"},\n",
    "        {\"id\": \"product-2\", \"category\": \"category-2\"},\n",
    "        {\n",
    "            \"id\": \"product-3\",\n",
    "            \"category\": \"category-2, category-3\",\n",
    "        },  # adding multilabel items using separator in a string\n",
    "        {\"id\": \"product-4\", \"category\": \"category-3\"},\n",
    "        {\"id\": \"product-5\", \"category\": \"category-4\"},\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9dfad64f-0df4-4861-8df8-6a592e7b31c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "query_uncateg_as_categ = (\n",
    "    Query(product_index)\n",
    "    .find(product)\n",
    "    .similar(category_space_uncategorized_category.category, Param(\"query_category\"))\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a4e7169-06c5-4078-a9ed-9eed2e339370",
   "metadata": {},
   "source": [
    "Note below that multi-label instances are less similar than a single category instance to a single category query - but similar nevertheless."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "11b7efd7-b353-467a-9abb-d0b72ae52d9e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>category-2</td>\n",
       "      <td>product-2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>category-2, category-3</td>\n",
       "      <td>product-3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>category-1</td>\n",
       "      <td>product-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>category-3</td>\n",
       "      <td>product-4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>category-4</td>\n",
       "      <td>product-5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 category         id\n",
       "0              category-2  product-2\n",
       "1  category-2, category-3  product-3\n",
       "2              category-1  product-1\n",
       "3              category-3  product-4\n",
       "4              category-4  product-5"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_other_uncateg_as_categ = app.query(\n",
    "    query_uncateg_as_categ, query_category=\"category-2\"\n",
    ")\n",
    "result_other_uncateg_as_categ.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b41f1be-a39d-412f-918d-f315b9fca426",
   "metadata": {},
   "source": [
    "Let's first see how the space works with `uncategorized_as_category=True`!\n",
    "\n",
    "In this case, category-4 category items are similar to other category-4 category items."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "9239867f-0269-4c71-b651-66cd0d37c040",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>category-4</td>\n",
       "      <td>product-5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>category-1</td>\n",
       "      <td>product-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>category-2</td>\n",
       "      <td>product-2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>category-2, category-3</td>\n",
       "      <td>product-3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>category-3</td>\n",
       "      <td>product-4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 category         id\n",
       "0              category-4  product-5\n",
       "1              category-1  product-1\n",
       "2              category-2  product-2\n",
       "3  category-2, category-3  product-3\n",
       "4              category-3  product-4"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_sunglass_uncateg_as_categ = app.query(\n",
    "    query_uncateg_as_categ, query_category=\"category-4\"\n",
    ")\n",
    "\n",
    "result_sunglass_uncateg_as_categ.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c003836-c827-4335-ab94-65e09d699dcb",
   "metadata": {},
   "source": [
    "But every `other` category item is similar to category-4 category items, too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "45ccf109-91da-44b6-aa58-9e25c5d2ec2c",
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>category-4</td>\n",
       "      <td>product-5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>category-1</td>\n",
       "      <td>product-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>category-2</td>\n",
       "      <td>product-2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>category-2, category-3</td>\n",
       "      <td>product-3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>category-3</td>\n",
       "      <td>product-4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 category         id\n",
       "0              category-4  product-5\n",
       "1              category-1  product-1\n",
       "2              category-2  product-2\n",
       "3  category-2, category-3  product-3\n",
       "4              category-3  product-4"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_other_uncateg_as_categ = app.query(\n",
    "    query_uncateg_as_categ, query_category=\"any other category\"\n",
    ")\n",
    "\n",
    "result_other_uncateg_as_categ.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7f257910-0e18-4938-bebb-cf873f0a6220",
   "metadata": {},
   "source": [
    "On the contrary, if we se `uncategorized_as_category=False`, no `other` category will be similar to each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "e86ef90d-7678-4d47-b6fc-bfdd80402a29",
   "metadata": {},
   "outputs": [],
   "source": [
    "category_space_no_uncategorized = CategoricalSimilaritySpace(\n",
    "    category_input=product.category,\n",
    "    categories=[\"category-1\", \"category-2\", \"category-3\"],\n",
    "    uncategorized_as_category=False,\n",
    ")\n",
    "product_index = Index(category_space_no_uncategorized)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "684f30a6-5170-4ba9-b913-0f8d8cfb8697",
   "metadata": {},
   "outputs": [],
   "source": [
    "source: InMemorySource = InMemorySource(product)\n",
    "executor = InMemoryExecutor(sources=[source], indices=[product_index])\n",
    "app = executor.run()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "bb766a61-fd0e-4a94-a4f6-078cc7ce2058",
   "metadata": {},
   "outputs": [],
   "source": [
    "source.put(\n",
    "    [\n",
    "        {\"id\": \"product-1\", \"category\": \"category-1\"},\n",
    "        {\"id\": \"product-2\", \"category\": \"category-2\"},\n",
    "        {\"id\": \"product-3\", \"category\": \"category-2\"},\n",
    "        {\"id\": \"product-4\", \"category\": \"category-3\"},\n",
    "        {\"id\": \"product-5\", \"category\": \"category-4\"},\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "118ba65c-b680-4ff9-bb59-5548f3721c1b",
   "metadata": {},
   "source": [
    "Neither category-4 to other category-4..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "b13b1887-08d5-40eb-a558-26defa06bc0b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>category-1</td>\n",
       "      <td>product-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>category-2</td>\n",
       "      <td>product-2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>category-2</td>\n",
       "      <td>product-3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>category-3</td>\n",
       "      <td>product-4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>category-4</td>\n",
       "      <td>product-5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     category         id\n",
       "0  category-1  product-1\n",
       "1  category-2  product-2\n",
       "2  category-2  product-3\n",
       "3  category-3  product-4\n",
       "4  category-4  product-5"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query_uncateg_not_categ = (\n",
    "    Query(product_index)\n",
    "    .find(product)\n",
    "    .similar(category_space_no_uncategorized.category, Param(\"query_category\"))\n",
    ")\n",
    "result_uncateg_not_categ = app.query(\n",
    "    query_uncateg_not_categ, query_category=\"category-4\"\n",
    ")\n",
    "\n",
    "result_uncateg_not_categ.to_pandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "afd9c87c-ac6f-4b8a-9eb2-5b0d918686eb",
   "metadata": {},
   "source": [
    "...nor any `other` category to category-4."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "b0d5c60a-44af-41cf-bf9e-97bbfe5aabc9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category</th>\n",
       "      <th>id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>category-1</td>\n",
       "      <td>product-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>category-2</td>\n",
       "      <td>product-2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>category-2</td>\n",
       "      <td>product-3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>category-3</td>\n",
       "      <td>product-4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>category-4</td>\n",
       "      <td>product-5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     category         id\n",
       "0  category-1  product-1\n",
       "1  category-2  product-2\n",
       "2  category-2  product-3\n",
       "3  category-3  product-4\n",
       "4  category-4  product-5"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_uncateg_not_categ = app.query(\n",
    "    query_uncateg_not_categ, query_category=\"something_else\"\n",
    ")\n",
    "\n",
    "result_uncateg_not_categ.to_pandas()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "superlinked-py3.10",
   "language": "python",
   "name": "superlinked-py3.10"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
